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Introduction 


A re you completely confused about predictive coding and 
how the technology can be used in eDiscovery, or are 
you a predictive coding expert hoping to learn about cutting- 
edge developments in the area? Either way, this book is perfect 
for you. 


Predictive coding technology is a new approach to attorney 
document review that can be used to help legal teams 
significantly reduce the time and cost of eDiscovery. Despite 
the promise of predictive coding technology, the technology 
is relatively new to the legal field, and significant confusion 
about the proper use of these tools is pervasive. This book 
helps eliminate that confusion by providing a wealth of 
information about predictive coding technology, related 
terminology, and the proper use of these tools. 


About This Gook 


Predictive Coding For Dummies, Symantec Special Edition, 
shows you what predictive coding is, how it works, and 
when to use it. This book also helps you understand how to 
choose the correct predictive coding solution to meet your 
organization’s needs and introduces new information about 
the evolution of the technology. 


How This Book Is Organized 


This book is divided into eight chapters. 


Chapter 1 shows you the basics of eDiscovery and predictive 
coding, and introduces the Electronic Discovery Reference 
Model (EDRM). 
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Chapter 2 explains the difference between predictive coding 
and other technology-assisted review (TAR) tools as well as 
how these tools should be used together. 


Chapter 3 introduces information about the basic terminology 
and workflow for using predictive coding tools properly. 


Chapter 4 shares information detailing the many benefits your 
organization can gain from using predictive coding. 


Chapter 5 introduces several different approaches for using 
predictive coding on actual cases. 


Chapter 6 discusses the challenges related to early-generation 
predictive coding tools. 


Chapter 7 helps you understand how to choose the proper 
predictive coding tool to meet your needs. 


Chapter 8 provides ten important facts you need to know 
about predictive coding. 


Icons Used in This Book 


This book uses the following icons to call your attention to 
information you may find helpful. 


The information marked by this icon is important and 


therefore repeated for emphasis. This way, you can easily 
spot noteworthy information when you refer to the book later. 


This icon marks places where technical matters, such as 
predictive coding jargon and legal terminology, are discussed. 


This icon points out extra-helpful information. 


Paragraphs marked with the Warning icon call attention to 
common pitfalls to avoid. 
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Chapter 1 


A Quick Overview 
of eDiscovery and 
Predictive Coding 


In This Chapter 
Comprehending eDiscovery 


Examining the Electronic Discovery Reference Model 


[« to almost any organization about legal issues and 
invariably the subject of eDiscovery comes up as a thorny 
pain point. These discussions commonly focus on the high 
costs of eDiscovery related to document review. This chapter 
provides an overview of eDiscovery and describes the 
different stages of the eDiscovery process. 


The chapter also introduces the concept of predictive coding 
and explains how it can help address many of the costs and 
burdens commonly associated with eDiscovery when used in 
conjunction with other eDiscovery tools. 


Understanding eDiscovery 


eDiscovery refers to the formal legal process whereby parties 
to alawsuit exchange electronically stored information 

(ESD in order to evaluate the merits of a case. Traditionally 
referred to as discovery, most people now refer to the process 
as eDiscovery since ES] is the principal form of information 
exchanged in litigation. 
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In the United States federal court system, the Federal Rules of 
Civil Procedure outline the rules parties must follow during 
discovery. Similarly, all states have their own version of these 
rules that are applicable to lawsuits filed within their respective 
court systems. 


Although eDiscovery is technically a term that applies to 
parties involved in litigation, the term is often used more 
broadly to refer to other situations in which parties are 
required to turn over information. Here are a few examples: 


Internal investigations 
Government inquiries or investigations 
Freedom of Information Act requests 


State public record requests 


For purposes of this book, the term “eDiscovery” is used 
broadly to apply to situations in which a party is required to 
turn over electronic information as part of an investigation or 
legal obligation. 


Finally, eDiscovery is not a concept limited to the United 
States. Many countries have developed formal rules to 
address the exchange of ESI in litigation. The list includes 
Australia, Canada, New Zealand, Singapore, and the United 
Kingdom (England and Wales). eDiscovery also applies to 
international regulatory inquiries and in the context of 
cross-border data protection laws. eDiscovery is now a 
universal principle applicable to organizations around 

the globe. 


The Electronic Discovery 
Reference Model 


The Electronic Discovery Reference Model (EDRM) is a model 
that is commonly used to depict each stage of the eDiscovery 
process (see Figure 1-1). 


Although each stage can be expensive, ESI review (ESI, 
document, and file review are used interchangeably throughout 
this book) is normally considered the most expensive part 

of the eDiscovery process. The high costs are due to the 
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expense associated with paying legal teams to manually 
review and segregate documents that are responsive (related) 
to issues in a case from those that are nonresponsive 
(unrelated). The document review process is typically 
triggered by what is known in eDiscovery as a “request for 
production of documents.” Generally, a request for docu- 
ments about issues in the case by one party (requesting 
party) requires the other party (responding party) to identify 
and produce the responsive documents that aren’t privileged 
Cegally protected from disclosure). There are generally 
multiple requests between parties that typically involve the 
exchange of millions of documents. Not surprisingly, the cost 
of document review in eDiscovery can be extremely high. 


Electronic Discovery Reference Model 






I 












































Preservation 
Information >| Production 
Management 
Collection 
Vv 
VOLUME RELEVANCE 


Figure 1-1: The EDRM shows the stages of the eDiscovery process. 


gone Despite the high cost of manually reviewing documents for 
responsiveness, most organizations spend time and money 
segregating responsive from nonresponsive documents in 
order to avoid inadvertently producing sensitive, confidential, 
or legally privileged information to requesting parties. Many 
different kinds of privilege can be asserted as the legal basis 
for withholding production of responsive documents, but a 
detailed list of those privileges is beyond the scope of this 
book. However, examples of commonly withheld documents 
include communications between attorneys and clients and 
documents containing attorney “work product” information. 
For purposes of simplicity, responsive ESI that a party 
wishes to withhold from production based on legal privilege, 
confidentiality, or other valid grounds will be generically 
referred to as “privileged” throughout this book. 


& 
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For example, a 2012 RAND Corporation study estimated that it 
costs organizations approximately $18,000 to review a single 
gigabyte of data. Review costs can quickly reach astronomical 
proportions considering that large organizations often deal 
with hundreds of cases, and many cases routinely involve 
hundreds of gigabytes of data or more. 


The number of files per gigabyte of data varies dramatically 
depending on the type of file. For example, a single gigabyte 
could include more than 50,000 e-mail messages without 
attachments. On the other hand, a gigabyte of music files may 
include closer to 400 or 500 files. 


The expense associated with document review continues to 
be exacerbated by tremendous worldwide data growth. For 
example, according to International Data Corporation (IDC), 
the amount of digital information created in 2010 was 1.2 
zettabytes. A zettabyte is equal to approximately 1,000 exabytes. 
Five exabytes is believed by some to be roughly equal to 
every spoken word ever uttered by mankind. 


The dramatic growth in worldwide information not only 
increases eDiscovery costs by requiring the review of more 
ESI, but information growth also increases organizational risk. 
Those risks include a higher likelihood of overlooking 
responsive documents that should have been produced as 
well as the risk of missing important deadlines. Each of these 
problems might lead to court-ordered sanctions (penalties) 
that could cost an organization more money and result in 
harm to its reputation. Similarly, the risk of inadvertently 
disclosing confidential information is increased because more 
information must be produced within limited timeframes. 


Judges have broad authority to issue a wide array of sanctions. 
Examples of common sanctions include monetary penalties, 
adverse instructions to the jury during trial, and possibly even 
dismissal of the case in extreme situations. A comprehensive 
discussion of sanctions is beyond the scope of this book. 


Many believe that predictive coding technology is the answer 
to several of the eDiscovery challenges facing so many 
organizations today. Chapter 2 introduces the basics of 
predictive coding technology and helps explain how this 
technology can help dramatically reduce the cost, risk, and 
time associated with traditional manual ESI review while also 
improving review accuracy. 
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Chapter 2 
Predictive Coding Defined 


In This Chapter 


Delving into predictive coding 
Looking at predictive coding and other technology tools 


Presi coding technology began gaining true momentum 
as an alternative approach to manual document review 
around 2010. Although machine learning (the underlying 
technology behind predictive coding) has existed for decades, 
the technology is relatively new to the legal profession. This 
newness has resulted in some confusion. For example, 
predictive coding may be interpreted differently by different 
people. Additionally, it is often referred to by various names, 
including computer-assisted review, technology-assisted 
review, and intelligent review, to name a few. 


This chapter helps clarify common questions about predictive 
coding and explains the difference between predictive coding 
and other types of technology-assisted review (TAR) tools. 


Explaining Predictive Coding 


Predictive coding is a type of machine-learning technology 

that enables a computer to help “predict” how documents 
should be classified based on limited human input. The 
technology is exciting for organizations attempting to manage 
skyrocketing legal budgets because the ability to automatically 
predict document responsiveness has the potential to save 
organizations millions in document review costs. The savings 
are mainly attributable to the fact that fewer dollars are spent 
paying lawyers to review and segregate responsive from nonre- 
sponsive documents when responding to discovery requests. 
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Instead of paying lawyers and legal teams to review and code 
large numbers of potentially responsive documents, predic- 
tive coding technology allows a fraction of the documents 

to be reviewed by humans and results in a fraction of the 
review costs. The process entails automatically feeding deci- 
sions made by attorneys about the responsiveness of a small 
number of case documents called a training set into a com- 
puter system. The computer relies on these training decisions 
to create a model that automatically generates a prediction 
score for every document based on the document’s degree of 
responsiveness. This information can be used to rank, ana- 
lyze, and review the documents quickly and efficiently. 


Coding or tagging refers to designating a particular classification 
to a document or group of documents. Documents are fre- 
quently coded with multiple designations that relate to 
various issues in the case during eDiscovery. However, for 
purposes of this book, the main coding designations discussed 
pertain to whether or not a document is responsive or 
nonresponsive to a request for production of documents. 


Training the predictive coding system is an iterative process 
that requires attorneys and their legal teams to evaluate the 
accuracy of the computer’s document prediction scores. 

If the accuracy of the computer-generated predictions is 
insufficient, additional training set documents are selected 
from the document population being considered. Multiple 
training sets are reviewed and coded until the required 
performance levels are achieved. Once the desired perfor- 
mance levels are achieved, decisions can be made about 
which documents to produce. 


For example, if the legal team’s analysis of the computer’s 
predictions reveals that within a population of 1 million 
documents, only those with prediction scores in the 70 percent 
range and higher appear to be responsive, the team may elect 
to produce only those 300,000 documents to the requesting 
party. The financial consequences of this approach are 
significant because a majority of the documents can be 
excluded from expensive manual review by humans (see 
Chapter 5 for more information about different approaches 
for managing final document productions). 


The terms manual review and human review are used inter- 
changeably throughout this book and have the same meaning. 
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gMBER 


& fare ‘ 
= The fewer documents requiring human review, the more 


money saved. 


Distinguishing Other TAR Tools 


Predictive coding technology is often confused with other 
types of TAR tools, such as concept searching and clustering 
technology (see definitions later in this section). However, 
unlike TAR tools that automatically extract patterns and 
identify relationships between documents with minimal 
human intervention, predictive coding requires a deeper level 
of human interaction. This interaction involves heavy reliance 
on humans to train and fine-tune the system through an 
iterative process and is often referred to as a type of super- 
vised learning. 


Some of the TAR tools used in eDiscovery that do not include 
this level of interaction are described as follows: 


Keyword search: Involves inputting a word or words 
into a computer which then retrieves documents within 
the collection containing the same words. Also known as 
Boolean searching, keyword search tools typically include 
enhanced capabilities to identify word combinations and 
derivatives of root words among other things. 


Concept search: Involves the use of algorithms to deter- 
mine whether a document is responsive to a particular 
search query. The technology typically analyzes variables 
such as the proximity and frequency of words as they 
appear in relationship to a keyword search. The technol- 
ogy can retrieve more documents than keyword searches 
because conceptually related documents are identified, 
whether or not those documents contain the original key- 
word search terms. 


Discussion threading: Utilizes algorithms to dynamically 
link together related documents (most commonly e-mail 
messages) into chronological threads that reveal entire 
discussions. This simplifies the process of identifying 
participants to a conversation and understanding the 
substance of the conversation. 


Clustering: Involves the use of algorithms that automati- 
cally organize a large collection of documents into differ- 
ent topical groupings based on similarity. 
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Find similar: Enables the automated retrieval of other 
documents related to a particular document of interest. 
Reviewing similar documents together accelerates the 
review process, provides full context for the document 
under review, and ensures greater coding consistency. 


 Near-duplicate identification: Allows reviewers to easily 
identify, view, and code near-duplicate e-mails, attach- 
ments, and loose files. Some systems can highlight differ- 
ences between near-duplicate documents to help simplify 
document review. 


Hype and confusion surrounding the promise of predictive 
coding technology has led some to suggest that this new 
approach may render other TAR tools obsolete. To the 
contrary, predictive coding technology should be viewed as 
one of many different types of tools in the litigator’s toolbelt. 
As described in more detail in Chapter 3, these tools are often 
used together to achieve the greatest efficiencies. 


Selecting an eDiscovery platform that includes a comprehensive 
set of TAR tools provides increased flexibility for addressing a 
wide variety of matters. 


For a more detailed description of machine learning technology, 
see Knowledge Discovery with Support Vector Machines by Lutz 
H. Hamel (Wiley) and Machine Learning in Action by Peter 
Harrington (Manning Publications). 


These materials are the copyright of John Wiley & Sons, Inc. and any 
dissemination, distribution, or unauthorized use is strictly prohibited. 


Chapter 3 


Basic Predictive Coding 
Terminology and Workflow 


In This Chapter 
Learning key terms 
Understanding the basic steps 


Ufoee:sa0ans how to use predictive coding tools 
properly is critical for several reasons. First, predictive 


coding is relatively new to the legal field and introduces 
additional complexity to the eDiscovery process. Second, 
many different predictive coding solutions are available on 
the market that vary in quality and approach. Third, even 
though predictive coding solutions can be difficult to use, 
clear instructions and training are often lacking, which can 
increase the risk of error. These and other factors have 
combined to create confusion about the proper methodology 
for using predictive coding tools. 


This chapter helps address the confusion surrounding various 
predictive coding methodologies by providing an overview 

of common predictive coding terminology and by describing 
a sample predictive coding workflow. You may not need to 
know the terms contained in the first section to use predictive 
coding tools effectively. However, the providers of these tools 
should be able to understand and explain how their products 
apply these concepts as part of their recommended workflow 
to ensure that you are comfortable with the defensibility of 
your process and technology solution. 
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Understanding Key Terminology 


Understanding common predictive coding terminology is an 
important step toward understanding the overall predictive 
coding process. The definitions provided in this section are 
explained in the context of acommon eDiscovery scenario in 
which the objective is to find responsive documents within 

a larger population of documents. Typically, responsive and 
nonresponsive documents are mixed together when they are 
initially collected from within an organization as part of a case. 
Therefore, prior to producing ESI to a requesting party, the 
responsive documents are normally segregated from the non- 
responsive documents so that only the responsive (and non- 
privileged) documents are produced to the requesting party. 


For purposes of understanding the terms defined here, 
assume there is a legal matter in which exactly 200,000 truly 
responsive documents exist within a population of 1 million 
documents. Also assume that a team of document review- 
ers determines that 300,000 of the 1 million documents are 
responsive. Finally, assume that of the 300,000 documents 
identified as responsive by the reviewers, only 150,000 of the 
documents they identified are truly responsive, meaning that 
50,000 responsive documents were overlooked and 150,000 
were incorrectly coded as responsive. 


Yield: Refers to the proportion of documents within a 
defined document population that meet a certain crite- 
ria. For example, if 200,000 out of 1 million documents 
are truly responsive, the yield (also referred to as the 
prevalence of responsive documents) is 20 percent 
(200,000/1,000,000 = 20%). 


Sample: Refers to the selection of a subset of documents 
within a larger document population to estimate the 
characteristics of the entire population. For example, a 
statistically valid random sample could be drawn from 
a population of 1 million documents to estimate the pro- 
portion of responsive documents (yield) within the larger 
population. If 20 percent of the sampled documents are 
responsive, then one could estimate that 20 percent or 
200,000 documents within the population of 1 million are 
responsive (.20 x 1,000,000 = 200,000). 
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Margin of error: Refers to the maximum likely difference 
between a true population value and a sample estimate 
of that value. For example, assume a random sample is 
used to estimate that 20 percent of documents within a 
population of 1 million are responsive. If the margin of 
error for the estimate is +/- 5 percent, then it is likely that 
somewhere between 15 and 25 percent of the population is 
estimated to be responsive. 


Confidence interval: Refers to a range of values computed 
from a sample that likely contains the true population 
value. Typically, the lower limit of the confidence interval 
is the sample estimate minus the margin of error, while the 
upper limit is the estimate plus the margin of error. 


¥ Confidence level: Refers to the likelihood that the true 
population value falls within the confidence interval (or 
the likelihood that the difference between the estimated 
population value and the true population value is less 
than the margin of error). For example, assume a random 
sample is used to estimate that 20 percent of the docu- 
ments within a population of 1 million are responsive. If 
the margin of error for the estimate is +/- 5 percent and the 
confidence level of the estimate is 95 percent, then there is 
95 percent confidence that between 15 and 25 percent of 
the documents in the population are responsive. 


Control set: Refers to a document sample used as a 
baseline for comparing and measuring test results. For 
example, a subset of documents can be selected from a 
larger document population and reviewed by experienced 
human reviewers to determine responsiveness as accu- 
rately as possible. A predictive coding tool’s predictions 
regarding the responsiveness of those same documents 
can then be compared to the control set to measure the 
tool’s performance. 


Training set: Refers to a subset of documents used to 
train the predictive coding system. Training sets are 
reviewed and coded by humans. The system then relies 
on information about how the training sets were coded 
to predict how other documents within the population 
should be coded. The initial training set is also referred 
to as the “seed” set. 
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Recall: Refers to the proportion or percentage of truly 
responsive documents identified within a defined 
document population that are identified as responsive. 
In the earlier example, since only 150,000 of the 200,000 
truly responsive documents were identified, recall is 
75 percent (150,000/200,000 x 100 = 75%). In other words, 
recall is a measure of completeness. 


Precision: Refers to the proportion or percentage of 
documents identified within a defined document 
population that are truly responsive. In the earlier 
example, since only 150,000 of the 300,000 identified 
documents are truly responsive, precision is 50 percent 
(150,000/300,000 = 50%). In other words, precision is a 
measure of exactness. 


 F-measure: Refers to the balance or “harmonic mean” 
between precision and recall. In the earlier example, the 
f-measure is 60 percent (2 x (75 x 50)/(75 + 50) = 60%). 


Comprehending the 
Basic Workflow 


Predictive coding workflows are extremely important, but 
even basic workflow recommendations seem to vary depending 
on the provider. Sometimes different workflows might be 
applied depending on the situation at hand or the user’s 
objectives (see Chapter 5 for a more detailed description of 
predictive coding approaches). Unfortunately, defining and 
executing a defensible predictive coding workflow can be com- 
plicated. The application of flawed workflow methodologies is 
a critical problem to avoid since no technology tool can pro- 
duce accurate results if the tool is not used properly. 


A common reason why workflow mistakes are made is due to 
the fact that predictive coding technology is relatively new to 
the legal field. The lure of financial opportunity has resulted 
in many companies racing to market with new technology 
offerings in order to capitalize on the legal community’s 
interest in predictive coding. Sometimes these tools are 
lacking in quality and sophistication. In other cases, the 
product is sound, but company representatives may not 
know how to use the tools properly and could inadvertently 
misinform customers and prospects about product capabilities 
and workflows during routine briefings and sales calls. 
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Many believe that these and other factors have combined 

to result in the widespread dissemination of misinformation 
about predictive coding that has increased confusion about 
how to use these tools properly. The following predictive 
coding workflow is not completely comprehensive or the 
only approach. However, the workflow outlined here is an 
approach to using predictive coding technology properly that 
can be modified to address a number of different use cases. 


Step 1: Culling the junk 


Culling the junk eliminates documents that are clearly not 
responsive or relevant before a predictive coding tool is even 
used. This step is important because the licensing structure for 
many predictive coding tools requires customers to pay higher 
fees to process files. Rather than needlessly incurring these 
unnecessary expenses, clearly nonresponsive files should be 
culled. Similarly, eliminating documents that are clearly non- 
responsive reduces the number of documents requiring down- 
stream processing and review. Culling the junk before beginning 
the predictive coding process can save time and money. 


Good predictive coding tools should be part of an eDiscovery 
platform that includes culling and technology-assisted review 
(TAR) tools that can be used together seamlessly. For example, 
the ability to cull nonresponsive files by date, file type, 
person, domain, and other parameters and then transfer the 
remaining documents to the eDiscovery platform’s predictive 
coding module should be easy. See Chapter 2 for more infor- 
mation about various TAR tools. 


Identifying and removing privileged documents from the 
document population in order to minimize the risk of 
inadvertently producing privileged documents is also a common 
approach at this stage. In the United States, Federal Rule of 
Evidence 502 establishes rules allowing parties to retrieve 
inadvertently disclosed electronically stored information (ESI 
that is privileged from other parties. Agreements between 
parties regarding inadvertent ESI disclosure are often referred 
to as clawback agreements. Many lawyers believe that ESI 
should still be thoroughly evaluated prior to production 
despite the existence of a clawback agreement since revealing 
privileged information to opponents could result in negative 
consequences even if the information is returned (see Chapter 6 
for information on waiver and defensibility). 
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Be careful to exercise proper judgment when culling documents 
in order to minimize the risk of eliminating responsive files 
that should have been produced. 


Step 2: Estimating the yield 


The purpose of this step is to estimate the yield, or prevalence, 
of responsive documents contained within the overall docu- 
ment population after the junk has been culled out. 


Estimating the yield begins by selecting and manually reviewing 
a statistically valid random document sample from the 
population. Some predictive coding tools are able to both 
calculate and randomly select a statistically valid number of 
documents for review automatically. 


If review of the initial random sample reveals that the estimated 
number of responsive documents within the population is low 
(low yield), then the size of the control set may need to be 
adjusted as explained in Step 3 to ensure system performance 
is measured correctly. 


Step 3: Selecting and reviewing 
the control set 


The next step is to select and review the proper number of 
documents to be included within the control set. The control 
set is used to help measure the predictive coding tool’s 
performance. The size of the control set depends on the 
system performance levels desired by the user as well as 
other factors. The calculation is critical because failing to 
select a sufficient number of documents for inclusion in the 
control set could result in high margins of error when the tool 
is used to make document predictions. 


Performance levels are typically measured by estimating 
recall, precision, and f-measure as described earlier. Ideally, 
the predictive coding tool utilized will be able to automatically 
calculate these measurements to avoid the need to calculate 
them manually outside of the system for every case. 


If the estimated population yield is high, the random 
sample selected in Step 2 can serve as the control set without 
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resulting in an excessive margin of error. On the other hand, if 
the yield is low, meaning that the estimated number of respon- 
sive documents within the population is low, additional calcula- 
tions should be conducted. 


These calculations are beyond the scope of this book, but 
generally they should determine the proper number of 
documents to be included in the control set in order to 
achieve the desired performance levels. Once the correct 
number of documents is determined, they can be randomly 
selected from the remaining document population and 
combined with the initial random sample (from Step 2) to 
form the control set. As described earlier, the documents in 
the control set can now be manually reviewed and coded for 
responsiveness by human reviewers. 


SING) Few tools have the ability to automate the calculation of a 
properly sized control set, and the importance of this step is 
commonly overlooked. If the control set is not properly sized, 
there is a high risk that the system’s performance will be inac- 
curate. This can result in unintentional misrepresentations to 
the court and opposing parties about the quality and 
thoroughness of the document review and production process. 
To find additional information addressing the underlying 
science and variables behind creating a properly sized 
control set, read “Predictive Coding Measurement Challenges” 
at http://www.clearwellsystems.com/e-discovery- 
blog/2012/07/06/predictive-coding-measurement- 
challenges-electronic-discovery/ 


Step 4: Training the system 


After the control set is manually reviewed, a small number 
of documents called a training set must also be separately 
selected, reviewed, and coded by humans to begin training 
the predictive coding system. 


ay? Documents to be included in the training set normally are not 
randomly selected. Instead, responsive documents (most sys- 
tems also recommend adding nonresponsive documents) from 
the population are targeted for inclusion in the training set. This 
is often done using keyword searches and other technology- 
assisted review (TAR) tools. This approach is commonly 
referred to as judgmental sampling since the user’s judgment is 
leveraged to select representative training documents. 
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The purpose of training the predictive coding system in 

this scenerio is for human reviewers to teach the predictive 
coding tool to identify the difference between responsive 
and nonresponsive documents. The system studies the docu- 
ment coding decisions in the training set to learn how human 
reviewers distinguish responsive from nonresponsive docu- 
ments. Next, the system leverages the knowledge shared by 
the human reviewers to generate a computer model that is 
used to assign a prediction score to each document based 
on degree of responsiveness. Finally, the prediction score 
and other features can be used to analyze, rank, and review 
all the case documents quickly and efficiently. This process is 
explained more fully in the remaining steps. 


Step 5: Testing the system 


After the initial round of training is complete (the training set 
has been reviewed), you can test the system. The purpose of 
testing the system is to measure the predictive coding tool’s 

performance. 


The test begins by directing the predictive coding system to 
make predictions about the responsiveness of the documents 
contained in the control set. The predictive coding system’s 
predictions are then compared to the coding decisions 

made by the human reviewers on the same set of documents. 
This comparison allows the performance of the predictive 
coding tool to be measured using recall, precision, and 
f-measure calculations. 


If the desired performance levels are not achieved, additional 
training documents must be selected, trained, and tested. 
This iterative process typically involves repeating Steps 4 and 5 
until the desired performance levels are achieved. Importantly, 
newer generation predictive coding tools possess active 
learning functionality that can automatically select the 

next training set to be reviewed by human reviewers, while 
early-generation tools may require these documents to always 
be selected manually. 
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Early tools may require a workflow whereby the predictive 
coding model is tested against all the remaining documents in 
the population instead of only testing against the documents 
in the control set. This approach requires significantly more 
processing time because the predictive coding model is 
repeatedly tested against the entire document population, 
which consists of significantly more documents than the 
control set. 


Step 6: Applying predictions 


After the desired performance levels are achieved, you can 
apply the predictive coding model to the remaining documents 
in the population. The purpose of this step is to leverage the 
predictive coding tool’s ability to assign prediction scores to 
every document. 


Chapter 4 discusses the benefits of completing these steps, 
and Chapter 5 explains different approaches for managing the 
final production of documents after Step 6. 


Selecting a reliable and trustworthy predictive coding 
provider who can articulate and understand proper predictive 
coding workflows is as important as the tool selected. If a 
particular predictive coding tool is easy to use, it is still 
important to make sure the underlying calculations the 
system makes behind the scenes to obtain results are also 
accurate and defensible. 
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Chapter 4 
Predictive Coding Benefits 


In This Chapter 
Comparing more traditional document review approaches 


Realizing the benefits 


[) 2: the promise of predictive coding, it is relatively 
new to the legal profession and many legal teams are 
wedded to more traditional approaches. This chapter compares 
predictive coding to more traditional approaches and discusses 
the benefits of predictive coding technology. 


Understanding the Traditional 
Approach to Document Review 


Keyword search tools are commonly used to segregate 
responsive from nonresponsive documents in order to 
respond to document requests during discovery. The tools 
allow users to type a word or phrase into a system that 
retrieves all the documents containing that word or phrase. 
Once potentially responsive documents are identified, they are 
manually reviewed by legal teams to verify responsiveness. 
The search and review process also normally includes 
identification and removal of privileged documents before 
the remaining responsive documents are produced to the 
requesting party. 


Chapter 1 explains the steep costs associated with paying 
teams of lawyers to manually review large volumes of 
documents and the belief that predictive coding tools will 
help reduce these costs. However, proponents of the 
traditional approach argue that while predictive coding may 
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be less expensive, it is also less reliable. Others, including 

the legal think tank known as The Sedona Conference, 

argue that any belief that keyword search is superior to 
automated review methods — such as predictive coding 
technology — is a myth (see “The Sedona Conference Best 
Practices Commentary on the Use of Search and Information 
Retrieval Methods in E-Discovery” for additional information, 
https: //thesedonaconference.org/publications). 


The problem with keyword searches is that knowing all the 
potentially responsive keywords within a large group of docu- 
ments is impossible. This requires users to make educated 
guesses about which keywords to include in searches. As a 
result, the ESI retrieved is often either under- or over-inclusive. 
If the quantity of documents retrieved is over-inclusive, then 
more time and money are spent segregating responsive from 
nonresponsive documents prior to production. If the quantity 
of documents is under-inclusive, then documents that should 
have been produced to the requesting party may be overlooked, 
possibly resulting in sanctions (see Chapter 1). 


Few dispute the potential for significant cost savings with 
predictive coding, but a key issue of debate is whether the 
technology performs as well as human reviewers. The issue is 
critical because parties must use reasonable efforts to respond 
to document requests. If a producing party can’t demonstrate 
that their predictive coding approach is as thorough as the 
traditional approach, use of the predictive coding technology 
may be challenged as unreasonable. Not surprisingly, the 
requirement that the document production process is 
reasonable and fair to both parties has sparked considerable 
debate about whether the predictive coding accuracy is as 
good as or superior to keyword searching and manual review. 


A commonly referenced law review article authored in 

2011 addresses the issue by arguing that predictive coding 
technology not only can be more accurate, it can be less 
expensive and time consuming than traditional manual 
review. In “Technology-Assisted Review in E-Discovery 

Can Be More Effective and More Efficient Than Exhaustive 
Manual Review,” Maura R. Grossman and Gordon V. Cormack 
compare results of exhaustive manual document review to 
technology-assisted review (TAR) methods. They conclude 
that TAR methods such as predictive coding can (and do) 
yield more accurate results than exhaustive manual review, 
with less effort. Although the article focuses primarily on the 
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accuracy of human review compared to TAR, it also provides 
general support for some of the benefits described here. 


Understanding the Benefits 
of Predictive Coding 


Predictive coding has slowly gained momentum in the legal 
community because many people believe the technology can 
be more accurate than traditional review methodologies while 
simultaneously reducing review time and costs. Some of the 
key benefits of predictive coding technology are described here. 


Reduced cost and time 


The main reason predictive coding technology costs less and 
takes less time is that the technology requires fewer documents 
to be reviewed by humans. Instead of requiring humans to 
painstakingly review each document for responsiveness, the 
technology relies on human input to help prioritize important 
documents for review and eliminate the need to review other 
documents altogether. Review costs can be substantially 
decreased if the predictive coding software costs are less 
than the costs of manual review. A general rule to remember 
is that the fewer documents requiring manual review and the 
lower the cost of using predictive coding software, the more 
money saved. The cost of using any predictive coding tool is 
a key factor that should always be considered. See Chapter 7 
for more information on choosing the right predictive coding 
technology. 


Strategic negotiations 


The ability to rank a large group of documents by estimated 
degree of responsiveness is also a valuable method for 
reducing costs and time. Using the example from Chapter 2, in 
a situation in which only documents containing a prediction 
score in the 70 percent range and higher appear responsive, a 
legitimate argument can be made that only the top 30 percent 
of documents should be produced. Reviewing and producing 
only the top 30 percent of documents in lieu of reviewing all 
the documents could result in significant time and cost 
savings for the producing party. 
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The prioritization and ranking of documents can be used to 
eliminate the need to manually review documents with the 
lowest rankings (see Chapter 5). The more documents that 
can be eliminated without requiring human review, the faster 
the review process and the more money saved. 


Many organizations conduct a final manual review of the 
documents before producing them (see Chapter 5). The more 
trustworthy the tool and process, the less risk associated with 
producing the documents without conducting a final review. 


If opposing parties demand production of more documents, 
the judge should consider the proportionality of the request. 
If only a small percentage of documents falling below the 

30 percent threshold are likely to be responsive, the judge 
may consider shifting the costs of additional review to the 
requesting party or denying the requesting party’s demand 
for additional documents altogether. 


Early case assessment 


Ranking documents by responsiveness also helps you find 
important documents quickly without requiring every docu- 
ment to be manually reviewed. The ability to identify the most 
important documents without first spending significant time 
and money sorting through other less important documents 
enables attorneys to assess the strength of their cases earlier. 


If key documents reveal a weak position, settling the case 
may be preferable to going to trial. On the other hand, if key 
documents are strong, then you gain leverage to help secure 
a better outcome through settlement negotiations or at trial. 
The ability to assess case strength early by ranking docu- 
ments with predictive coding tools saves time and money. 


Increased accuracy 
and reduced risk 


Since computers don’t get tired or day dream, many believe 
predictive coding technology can determine document 
responsiveness better than humans. Accuracy is important 
because the risk of overlooking important documents could 
have severe consequences (see Chapter 1). Importantly, 
regardless of the type of tool used, it must be used properly 
to avoid increasing the risk of error. 
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Chapter 5 


Predictive Coding 
Approaches 


In This Chapter 
Taking a look at the basic steps 
Exploring possible document production approaches 


( hapter 1 explains that eDiscovery is a legal term that 
applies to parties involved in litigation. However, the 
term eDiscovery is also commonly used more broadly to refer 
to a wide range of situations requiring the identification and 
production of documents. Although predictive coding technology 
can be used to help parties respond to document requests in 
many situations, there are different approaches for handling the 
final production of documents. This chapter examines some of 
those approaches and explains some of the situations or use 
cases in which certain approaches may be favored over others. 


Understanding the Basic Steps 


There are multiple techniques for using predictive coding to 
streamline the production of documents to requesting parties, 
and each approach involves a number of steps. The initial 
steps listed here are normally followed regardless of which 
workflow is selected. (See Chapter 3 for a more detailed 
predictive coding workflow description.) 


Predict responsiveness 


After the predictive coding system has been trained and the 
desired accuracy levels are achieved, you can use the tool to 
predict the responsiveness of every document. 
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Leverage prediction scores 


Once predictions are applied to all the documents, each docu- 
ment is assigned a prediction score, expressed as a percentage, 
indicating the likelihood of responsiveness. This information is 
invaluable because it enables you to rank documents in order 
of importance. For example, if you prefer to prioritize your 
review by analyzing the most important documents first, then 
you could easily navigate to the documents ranked within the 
top 10 percent based on degree of responsiveness. 


Similarly, prediction scores are extremely important because 
they also enable you to determine and even negotiate what per- 
centage of documents will be produced to the requesting party. 
As explained in Chapters 2 and 4, a legal team could use the 
prediction scoring system to decide that only documents con- 
taining scores above a certain threshold will be produced. The 
team could defend their decision to only produce documents 
above a certain threshold by illustrating that few, if any, docu- 
ments falling below that threshold are likely to be responsive. 
This approach saves time and money because many if not most 
of the remaining documents may no longer require review. 


Importantly, some predictive coding systems have built-in 
transparency features to help explain the rationale behind 
every document’s prediction score. Among other things, these 
transparency features include links between related documents 
and a summary of important words and phrases contained in 
each document that helped determine its responsiveness. 


Privilege screening 


If you have concerns about inadvertently producing privileged 
ESI, you should always screen the documents designated for 
production for privileged information. This step essentially 
repeats the same privilege screening process that should 
occur prior to beginning the predictive coding process (refer 
to Chapter 3). For example, you can use TAR tools to conduct 
a final search for attorney and law firm names contained in the 
remaining documents as one method to catch any privileged 
documents that may have slipped through the cracks. 
Additionally, you can create a new prediction model to help 
identify privileged documents as a final quality control measure. 
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Handling Final Document 
Productions 


Although following the basic workflow steps described in the 
previous section makes sense, selecting the proper approach 
for the final production of documents is an individual decision 
that must be made on a case-by-case basis. Some of the factors 
that should be balanced when deciding on a final document 
production approach include 


The potential for cost savings 

The degree of risk involved 

Quality of the predictive coding tool used 
Document production deadlines 


Value of the case 


Different organizations faced with the same set of circum- 
stances might use different approaches depending on how 
they balance the factors. This section explains a few different 
approaches for handling the final production of documents 
and provides guidance to help evaluate the decision. The 
approaches described here assume at least some, if not all, of 
the steps described in the previous section already occurred. 
The approaches discussed also focus primarily on balancing 
the desire for cost savings with the desire to avoid producing 
privileged documents. 


Produce without review 


Producing documents designated for production following the 
final privilege screen is the most cost-effective approach for 
managing the final production of documents. This approach 
doesn’t include spot checking or any additional manual 
review prior to production. Instead, the approach relies on 
the experience and expertise of those tasked with training and 
managing the predictive coding system and process, as well 
as the quality of the predictive coding tool used. 


Predictive coding technology is designed to exceed the accuracy 
of human review. In other words, predictive coding technology 
should reduce the risk of producing privileged documents if the 
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system performs accurately and is used properly (see Chapter 2, 
which compares the accuracy of human and technology-assisted 
review). Regardless, some are reluctant to produce documents 
without first conducting a final manual review. The remaining 
approaches described next can be used by those who wish to 
introduce added safeguards into the process. 


All predictive coding tools are not created equally. That means 
your level of comfort using a particular approach depends in 
large part on the quality of the tool selected. Other factors to 
evaluate include confidence the tool and process are properly 
administered, the prevalence and significance of privileged 
documents in the original document population, the additional 
cost of spot checking or manual review, and case deadlines. 


Produce after spot checking 


Randomly sampling or “spot checking” documents designated 
for production is another cost-effective approach for manag- 
ing the final production of documents. This approach typi- 
cally involves randomly sampling and selecting documents 
designated for production and reviewing them to check for 
privileged documents. If privileged documents are found, they 
can be set aside prior to production. Alternatively, a determi- 
nation could be made that further training of the predictive 
coding system is required prior to production. 


Documents not designated for production could also be ran- 
domly sampled to determine whether documents that should 
be produced are properly designated. 


Prioritize and produce 
after manual review 


Using the predictive coding system to prioritize documents 
for manual review is the least cost-effective approach for man- 
aging the final production of documents. This approach goes 
beyond mere spot checking and includes the manual review 
of most if not every document prior to production. If you have 
concerns about the technology used or the process followed, 
this approach may be right for you. However, investing in 
trustworthy tools and technology providers in the first place 
is amore cost-effective option that can minimize the need to 
follow this approach. 
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Chapter 6 


Early Predictive Coding 
Challenges 


In This Chapter 
Looking at first-generation tools 
Understanding risk and defensibility issues 


Gr= the promise of faster and less expensive document 
reviews combined with higher accuracy rates, many 
people don’t understand why predictive coding technology 
hasn’t experienced wider adoption. This chapter explains 
some of the perceived risks associated with early-generation 
predictive coding tools that have led some to take a wait-and- 
see approach. This chapter also explores how risks related to 
using complex technology are closely related to legal defensi- 
bility and provides guidance on how to minimize those risks. 


First-Generation Predictive 
Coding Technology 


Predictive coding tools apply a complicated new technological 
approach to a document review process that has tradition- 
ally been very simple. The new process typically involves a 
series of steps described in Chapter 3 that includes sampling, 
training, testing, and measuring results in order to fine-tune an 
algorithm that is used to help predict the responsiveness of 
the remaining documents. Some believe asking attorneys to 
use predictive coding technology instead of more traditional 
methodologies is like asking them to fly jet airplanes when 
they are more accustomed to driving cars. 
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Although the underlying machine-learning technology behind 
predictive coding is nothing new, predictive coding is new to 
the legal field. Introducing a new technological approach to 
document review in an era in which many attorneys rely 
heavily on keyword searching and manual review (see 
Chapter 4) presents challenges. Perhaps the biggest challenges 
lie in the fact that early-generation technology tools can be 
difficult to use. The issue is certainly not unique to predictive 
coding software. The improvement and evolution of technology 
solutions is common in the world of technology. For example, 
Apple regularly releases new versions of products like the 
iPad and iPhone to improve product performance. Similarly, 
Microsoft has released multiple versions of Internet Explorer, 
Exchange, and essentially all of its active products over the 
years to address bugs and enhance their solutions. 


Similarly, predictive coding technology is still in its infancy 
with respect to eDiscovery. That means the tools must 
continue to evolve and become more transparent for end 
users so they are easier to use. That does not mean predictive 
coding tools should be abandoned in favor of more traditional 
approaches. However, it does mean predictive coding tools 
should be selected and used cautiously as the technology and 
knowledge about proper use of the technology evolves. While 
product evolution continues, it is important to identify 
predictive coding tools that are understandable, relatively 
simple to use, and integrated within a broader eDiscovery 
software platform. 


Aligning with a competent solution provider is equally 
important. First, the provider must be able to train and 
support users on all aspects of the product. Second, because 
predictive coding technology is evolving, it is extremely 
important to believe in the provider’s long-term product 
vision. Finally, any provider must be able to explain the 
statistical methodology behind their technology to ensure 
accuracy levels represented to the court and opposing parties 
are always valid. If the technology provider cannot provide 
these types of assurances, consider looking elsewhere. 


The following sections provide more detail about how to 
avoid some of the risks associated with early-generation 
predictive coding tools. 
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Difficult to use 


A key problem with early predictive coding tools is that 
implementing a proper workflow can be confusing and 
complex. This complexity normally exists because the tools 
are new to the legal field so complex steps related to sampling 
and measuring system performance are not fully automated 
and are sometimes misunderstood. Since early-generation 
tools do not automate many of the complex steps in the 
process, users commonly face a series of difficult workflow 
decisions that aren’t always intuitive. A single misstep along 
the way could lead to flawed document productions and 
problems defending the reasonableness of the process. 


Surprisingly, even though many early-generation tools are 
not intuitive, instructions and training regarding defensible 
workflows is often lacking. Conversely, although some tools 
are easier to use, many times the rationale behind the tool’s 
methodology for identifying responsive documents and 
measuring the system’s performance are not transparent. 


Transparency and the black box 


Many predictive coding tools do not provide visibility into 
how important decisions are made by the computer. This lack 
of transparency has led some to characterize early tools as 
black box technologies. 


A common black box problem is the lack of visibility into why 
the predictive coding tool designates some documents as 
responsive and not others. As predictive coding technology 
evolves, look for advanced reporting tools to improve 
transparency into these kinds of decisions for improved 
defensibility. For example, if questions surface about why 

a particular document is designated as responsive, users 
should be able to view and analyze all related documents that 
helped form the basis for the system’s decision. 


Similarly, predictive coding providers should be able to 
illustrate the underlying methodology and statistical 
calculations behind their products. Basic formulas used to 
sample, test, and measure system performance should be 
well supported within the broader academic community. If 

a provider’s methodology is supported within the academic 
community and properly applied, the need to hire experts in 
order to defend the use of a particular tool can be minimized. 
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Defensibility Concerns 


Predictive coding tools offer a new approach to document 
review that deviates from the status quo. Not surprisingly, 
early-generation tools have been heavily scrutinized. The 
following sections address common concerns and challenges 
related to the technology’s early use. 


Waiver and defensibility 


Perhaps the biggest concern with early predictive coding tech- 
nology is the risk of privilege waiver and concerns about defen- 
sibility. Since many early predictive coding tools are not yet fully 
automated and transparent, they can be difficult for the average 
attorney to understand and use effectively. This fact highlights 
the importance of using predictive coding tools that are reliable, 
easy to use, and defensible. Not only are complex technologies 
and workflows difficult to defend if they are hard to understand, 
this complexity can also increase the risk of error. 


Increasing the risk of human error by using complex predictive 
coding tools means that the chance of overlooking important 
documents or inadvertently producing privileged documents 
is also increased. If important documents that should be 
produced are not produced, the producing party could face a 
wide variety of sanctions. 


On the other hand, inadvertently producing sensitive or 
privileged documents could also have negative consequences. 
For example, a privileged document may reveal trade secrets 
that could be damaging to the organization. Similarly, privileged 
communications, such as an e-mail between an attorney and 
client, may no longer be protected from disclosure (waived) if 
it is inadvertently disclosed to a third party. 


See Chapter 1 for a definition of sanctions and Chapter 3 for 
more information about reducing the risk of privilege waiver. 


Judicial guidance 


Beginning in early 2012, the lack of judicial guidance regarding 
the use of predictive coding technology was addressed. Many 
expressed concerns about whether or not judges would deem 
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predictive coding technology acceptable as an alternative to 
more traditional eDiscovery approaches. 


Resistance to change in a profession where keyword search 
and manual document review have long been considered the 
eDiscovery gold standard is not surprising. What is surprising 
is that some of the early cases that many expected to open 
the door to widespread adoption of predictive coding fueled 
further resistance among some to using the technology. 
Although the value of using predictive coding technology to 
improve document review is highlighted in these cases, the 
cases also illustrate that early-generation predictive coding 
technology can be complex and difficult to use. The following 
sections provide a brief summary of three early cases involv- 
ing predictive coding technology. 


Although evaluating ease of use and the underlying methodology 
behind various predictive coding alternatives is important, 
case law illustrates that establishing the proper workflow for 
using any tool is equally critical. 


Da Silva Moore v. Publicis Groupe 


In Da Silva Moore v. Publicis Groupe (2012), Magistrate Judge 
Andrew Peck of the U.S. District Court for the Southern 
District of New York issued the first-known court order 
endorsing the use of predictive coding technology “in 
appropriate cases.” In Da Silva Moore, the parties agreed to 
use predictive coding technology, but continued to disagree 
on the proper protocol or process. Plaintiffs argued that the 
court should not have adopted the defendant’s recommended 
protocol over their objections. On the other hand, defen- 
dants argued that the protocol is exceedingly fair to plaintiffs 
considering, among other things, that plaintiffs were granted 
permission to dispute the defendant’s coding decisions with 
respect to nonresponsive documents. 


Plaintiffs continue to claim that the existing protocol lacks 
adequate sampling techniques and will result in responsive 
documents being overlooked that defendants are obligated 

to produce. Plaintiffs’ initial motions to overturn the order 
outlining the predictive coding protocol and to have Judge 
Peck removed from the case were both denied. As of publication 
time, the parties remain at odds and have been involved in 
skirmishes about which documents should and should not be 
characterized as responsive. 
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Updates and blogs regarding Da Silva Moore and other 
important eDiscovery matters are routinely provided on 
www.clearwellsystems.com/e-discovery-blog/. 


Kleen Products, LLC v. Packaging 
Corporation of America 


Also in early 2012, Magistrate Judge Nan Nolan of the U.S. 
District Court for the Northern District of Illinois began 
tackling the issue of predictive coding technology in Kleen 
Products, LLC v. Packaging Corporation of America (2012). In 
Kleen, plaintiffs basically asked Judge Nolan to order defendants 
to redo their production even though at least one of the 
defendants spent thousands of hours reviewing documents, 
produced over a million documents, and completed a 
significant part of its document review. The parties presented 
witness testimony in support of their respective positions for 
two full days, and more testimony may be required before the 
eDiscovery issues are resolved. At publication time, it is not 
clear whether the defendants will be required to use predic- 
tive coding technology in order to supplement the more tradi- 
tional eDiscovery approach they have already followed. 


Global Aerospace Inc., v. Landow Aviation, L.L.P. 


Finally, on April 23, 2012, Virginia Circuit Court Judge James H. 
Chamblin issued what appears to be the first state court order 
approving the use of predictive coding technology for eDiscovery. 
In Global Aerospace Inc. v. Landow Aviation, L.L.P. (2012), 
defendants sought an order allowing them to use predictive 
coding technology after opposing counsel objected to their 
proposed use of the technology to “retrieve potentially rel- 
evant documents from a massive collection of electronically 
stored information.” Importantly, Judge Chamblin issued the 
order “without prejudice to a receiving party,” which would 
allow the plaintiff to challenge the use of predictive coding or at 
least object to the “completeness of the production.” So far, it 
doesn’t appear that the parties have run into significant issues. 


Case law regarding the use of technology tools is likely to 
evolve. Expect future cases to further validate the fact that 
using eDiscovery technology properly is as important as 
selecting the right technology. 
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Chapter 7 


Choosing the Right 
Predictive Coding Software 


In This Chapter 


Researching product offerings 
Knowing the right questions to ask 


[:: promising future of predictive coding technology has 
resulted in a new market opportunity for many in the 
eDiscovery business. This has many companies racing to 
market with new technology offerings in order to capitalize 
on the opportunity. Some of these new tools lack quality 
and sophistication. Other tools are sound, but company 
representatives may not know how to use the tools properly 
and could end up misinforming customers and prospects. 


Further complicating the ability to choose the right solution is 
the fact that many companies license third-party technology 
they brand as their own and/or they rely on partners to sell 
their products and services for them. In both situations, 
people selling and supporting these solutions may have 
limited first-hand knowledge about how the technology works. 


All of these market dynamics add to the spread of misinforma- 
tion within the industry which creates confusion about product 
capabilities and the best methodologies for using these 
solutions. Unfortunately, making informed decisions in the 
midst of this confusion is difficult for consumers attempting 
to select a solution. On the other hand, working with people 
and organizations you trust can help minimize this pain. 


This chapter provides important guidelines to make evaluating 
the different predictive coding tools available on the market 
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easier. Following these guidelines will empower you to ask the 
right questions so you can identify the right predictive coding 
solution for your organization. 


Doing Vour Research 


ar 


The growth of the eDiscovery market and the promising 
future of predictive coding technology have resulted ina 
wide variety of companies introducing new predictive coding 
solutions to the marketplace. Although variety is good, 
understanding the differences between these solutions and 
their providers can be challenging. 


Industry analyst Gartner Inc. estimates that the enterprise 
e-Discovery software market reached $1 billion in total 
software vendor revenue in 2010. The five-year CAGR 
(Compound Annual Growth Rate) is approximately 16 percent. 


All technology solutions are not created equal, so properly 
vetting solutions before using them is critical. The following 
guidelines will help you ask intelligent questions when 
evaluating which predictive coding software solution is right 
for your organization. 


Most organizations make eDiscovery purchasing decisions 
based on the desire for a comprehensive eDiscovery platform. 
A comprehensive platform typically includes modules for 
administering legal hold notices, collecting ESI from multiple 
sources, processing, culling, and analyzing that ESI, and then 
reviewing and producing the remaining ESI. Predictive coding 
is merely one of the many important tools that should be 
included within a comprehensive eDiscovery platform. 


Placing too much emphasis on any one of the following steps 
is a mistake. Most, if not all, of these steps should be 
considered when evaluating various solutions, but following 
these steps in the order listed is not required. 


Consult independent reports 


Hundreds of eDiscovery providers are clamoring for your 
business, and giving you many different technology solutions 
to choose from. The bad news is that these choices make 
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comparing solutions and the companies behind them difficult. 
The good news is that independent analysts like Gartner Inc. 
perform in-depth market research annually so that you don’t 
have to start from scratch. Whether you are new to predictive 
coding and eDiscovery or a seasoned veteran, take advantage 
of independent reports such as Gartner’s “2012 Magic 
Quadrant for eDiscovery Software” to help you identify the 
industry leaders and understand the breadth and quality of 
their eDiscovery offerings. 


Request a formal proposal 


Many organizations have a formal procurement process for 
making purchasing decisions that require the technology 
providers under consideration to submit a written proposal. 
Written proposals are often a good way to obtain basic 
information about how different solutions are priced, 
supported, and perform. The value of the request for proposal 
(RFP) process depends on how the questions are asked and 
how the procurement process is administered. Although RFPs 
can be valuable tools for gathering information about different 
solutions, RFP questions and responses are often poorly 
worded and confusing. For this reason, making decisions about 
which solution to include or exclude from the procurement 
process, based exclusively on RFP responses, can be risky. 
Some, if not all, of the following steps should be incorporated 
into the RFP and broader procurement process to minimize 
this risk. 


Seek customer references 


Although reading and understanding what independent 
analysts say about predictive coding software providers and 
the companies behind them is important, analysts typically 
don’t have the luxury of actually using these tools regularly 
in practice. That’s why speaking to customers who use the 
eDiscovery solutions you’re considering is a critical step that 
you should follow prior to making an investment. Basic 
questions to ask may include the following: 


Does the solution perform as advertised? 
Is product training comprehensive? 


How is technical support? 
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¥ Was the provider honest during the sales process? 


Is predictive coding one part of an integrated eDiscovery 
platform or are multiple solutions required? 


How long did it take to implement the solution? 


Keep in mind that just because many solution providers are 
prepared to provide a list of happy customer references upon 
request, you should consider digging deeper. Independently 
talk to your peers in the industry rather than relying on 
customer references supplied by the provider you are 
considering. Make sure to always consider the accuracy of 
the information provided and try to talk to multiple sources 
rather than relying too heavily on one particular reference. 


Conduct demonstrations and POCs 


The best way to understand and evaluate how a product 
works is through product demonstrations and discussions. 
Product demonstrations not only represent an opportunity 
to learn how products stack up against each other, they also 
present a good opportunity to ask providers tough questions. 
On the other hand, just because a product demonstration 

is impressive and company representatives dazzle you 

with their industry knowledge and a long list of product 
differentiators, don’t fall in love too early. 


A critical step in vetting any enterprise software solution 
(software that is deployed internally within an organization’s 
information technology infrastructure) is to put the software 
through rigorous internal testing commonly called a proof of 
concept (POC). The POC should entail connecting the 
eDiscovery software to the company network to evaluate 
how the solution performs within the company’s unique 
information technology environment on company data. 
Providers reluctant to perform this step or that suggest 
alternative approaches may not have confidence that the 
solution can perform as well in a “live” environment as was 
represented during the product demonstration. 


Testing the performance of an eDiscovery technology 
platform on-site in a live environment through a proof of 
concept is important because product shortcomings become 
more difficult to hide. However, in some situations, testing a 
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particular module of an eDiscovery solution may not require 
connecting the solution to the organization’s network. Make 
sure to involve resources from your organization’s information 
technology department to evaluate when testing the solution 
within the network should occur. 


Asking the Right Questions 


As described earlier, when evaluating various predictive 
coding solutions, it is important to consider the entire 
eDiscovery platform. However, you should still ask specific 
questions about each provider’s predictive coding tool that 
can help with the evaluation process. Although not an 
exhaustive list, the following questions are helpful: 


Are all critical modules such as legal hold notice, ESI 
collection, processing/culling, and review all integrated 
within the same eDiscovery platform? 


¥ Is the eDiscovery platform made up of different 
technologies acquired through acquisition or licensed 
from third parties or was the entire platform developed 
by a single provider? 


Is the solution provider financially stable and able to 
support industry growth? 


¥ What is the solution provider’s long-term product 
development vision and business plan? 


What is the pricing structure? Is there an additional cost 
to use predictive coding? 


Is the solution truly a predictive coding solution built on 
machine learning technology or is the solution really a 
form of concept searching or clustering technology? 


¥ What is the underlying statistical methodology used by 
the system and is that methodology generally accepted 
within the academic community? 


¥ Is the process for selecting document samples and 
measuring system accuracy automated and defensible? 


Can the system generate reports so proportionality 
arguments such as cost-shifting can be made before 
document review begins? 
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What happens when additional ESI must be added to 
an existing predictive coding project that has already 
begun? 


Is it possible to link documents together to understand 
the basis for individual coding decisions made by the 
system? 


How fast can ESI be processed generally and how does 
the solution perform on cases containing large data 
volumes? 


Can training sets be selected actively and automatically 
using computer intelligence or must they always be 
selected randomly? 


Can predictive coding intelligence from one matter be 
applied to similar matters with little effort for greater 
efficiency? 
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Chapter 8 


Ten Important Things about 
Predictive Coding 


In This Chapter 
Taking advantage of predictive coding tips 
Understanding key technology differentiators 


[resus this book, I repeat the fact that predictive 
coding technology introduces a promising new approach 
to eDiscovery. However, since the use of these tools also 
introduces a new level of complexity, selecting the right tool 
and using that tool properly is critical. 


This chapter provides tips about critical issues you need to 
understand in order to take advantage of the many benefits of 
predictive coding technology without introducing undue risk 
into your eDiscovery process. 


Perfection Is Not Required 
in eDiscovery 


Regardless of the tools or techniques utilized to respond to 
document requests in eDiscovery, perfection is not required. 
The goal should be to create a reasonable and repeatable pro- 
cess to establish defensibility in the event you face challenges 
by the court or an opposing party. Make sure the predictive 
coding tool and broader eDiscovery platform you choose func- 
tions correctly, is used properly, and can generate reports 
illustrating that a reasonable process was followed. Remember, 
making smart decisions to establish a repeatable and defensi- 
ble process early will reduce the risk of downstream problems. 
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Predictive Coding Is Just One 
Tool in the Litigator’s Toolbelt 


Although the right predictive coding tools can reduce the 
time and cost of document review and improve accuracy 
rates, they’re not a substitute for other important technology 
tools. Keyword search, concept search, domain filtering, and 
discussion threading are only a few of the other important 
tools in the litigator’s toolbelt that can and should be used 
together with a predictive coding tool. Invest in an eDiscovery 
platform that contains a wide range of seamlessly integrated 
eDiscovery tools that work together to ensure the simplest, 
most flexible, and most efficient eDiscovery process. 


Using Predictive Coding Tools 
Properly Makes All the Difference 


eDiscovery tools, like most technology solutions, are only 
effective if used properly. Since many early-generation tools 
are difficult to use and understand, learning how to use those 
tools properly is critical to your eDiscovery success. To 
maximize your success and minimize the risk of problems, 
select trustworthy predictive coding tools supported by 
reputable solution providers and make sure to learn how to 
use the tool properly. 


Predictive Coding Isn't 
Just for Big Cases 


Sometimes predictive coding tools must be purchased 
separately from other eDiscovery tools or additional fees are 
required to use them. As a result, many practitioners only 
consider predictive coding for the largest cases to ensure the 
cost of eDiscovery doesn’t exceed the value of the case. If 
possible, invest in an eDiscovery solution that includes 
predictive coding as part of an integrated eDiscovery platform 
containing legal hold, collection, processing, culling, analysis, 
and review capabilities at no additional charge. Since the cost 
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of using different predictive coding tools varies dramatically, 
make sure to select a tool at the right price point to maximize 
economic efficiencies across multiple cases regardless of size. 


Investigate the Solution 
Providers 


All predictive coding tools are not created equal. The tools 
vary significantly in price, usability, performance, and overall 
reputation. Although the availability of trustworthy and 
independent information comparing different predictive 
coding tools is limited, information about the companies 
behind these different tools is available. Make sure to review 
independent research from analysts such as Gartner Inc. as 
part of your vetting process instead of starting from scratch. 
Once your organization is serious about selecting an 
eDiscovery platform or predictive coding tool, make sure to 
follow the guidelines discussed in Chapter 7. 


Test Drive Before Vou Buy 


Savvy eDiscovery technology investors take steps to ensure 
that the predictive coding tool they are considering works 
within their organization’s environment and on their 
organization’s data. Product demonstrations are important, 
but testing products internally through a proof of concept 
(see Chapter 7) is even more important if you are contemplating 
bringing an eDiscovery platform in house. Additionally, check 
company references before investing in a technology solution 
to find out how others feel about the solutions they purchased 
and the level of product support they receive. 


Defensibitity Is Paramount 


Although predictive coding tools can save organizations 
money through increased efficiency, the relative newness and 
complexity of the technology can create risk. To avoid this 
risk, choose a predictive coding tool that is easy to use, devel- 
oped by a trusted company, and fully supported. 
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Statistical Methodology and 
Product Training Are Critical 


The underlying statistical methodology behind any predictive 
coding tool is critical to the defensibility of one’s eDiscovery 
process. Many providers fail to incorporate a product 
workflow for selecting a properly sized control set in certain 
situations (refer to Chapter 3). Unfortunately, this oversight 
could unwittingly result in misrepresentations to the court 
and opposing parties about the system’s performance. Select 
providers capable of illustrating the statistical methodology 
behind their solution approach and that are capable of 
providing proper training on the use of their system. 


Transparency Is Key 


Chapter 6 explains why many practitioners are legitimately 
concerned that early-generation predictive coding solutions 
operate as a “black box,” meaning the way they work is 
difficult to understand. Since it is difficult to defend technology 
that is difficult to understand, selecting a solution and 
process that can be explained in court is critical. Make sure 
to choose a predictive coding solution that is transparent to 
prevent allegations by opponents that your tool is "black box” 
technology that cannot be trusted. 


Align with Attorneys Vou Trust 


The fact that predictive coding is relatively new to the legal 
field and can be more complex than traditional approaches 
to eDiscovery highlights the importance of aligning with the 
right legal counsel. Many attorneys defer legal technology 
decisions to others on their legal team and have little 
practical experience using these solutions themselves. 
Conversational knowledge about these tools isn’t enough 
given the confusion, complexity, and risk related to selecting 
the wrong tool or using the tools improperly. Make sure to 
align with an attorney who possesses hands-on experience 
and who is able to articulate specific reasons why they prefer 
a particular solution or approach. 
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Use predictive coding technology 
to significantly reduce eDiscovery 
time and cost! 


Predictive coding technology is a new approach 
to attorney document review that can help 
organizations significantly reduce the time and 
cost of eDiscovery. Regardless of whether you're 
new to predictive coding or a seasoned veteran, 

this book provides a wealth of information about a 
wide variety of important issues that every legal 
team should understand. This book explains 
common predictive coding terminology and reveals 
the many benefits of the technology when used 
as part of a broader eDiscovery technology platform. 
You also uncover secrets for avoiding eDiscovery 
pitfalls and learn tips for establishing a legally 
defensible workflow for using predictive coding 
tools. Lastly, a list of valuable guidelines is provided 
to help you select the right predictive coding solution 
to fit the needs of your organization. 
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